Goto

Collaborating Authors

 variable importance



From global to local MDI variable importances for random forests and when they are Shapley values

Neural Information Processing Systems

Random forests have been widely used for their ability to provide so-called importance measures, which give insight at a global (per dataset) level on the relevance of input variables to predict a certain output. On the other hand, methods based on Shapley values have been introduced to refine the analysis of feature relevance in tree-based models to a local (per instance) level. In this context, we first show that the global Mean Decrease of Impurity (MDI) variable importance scores correspond to Shapley values under some conditions. Then, we derive a local MDI importance measure of variable relevance, which has a very natural connection with the global MDI measure and can be related to a new notion of local feature relevance. We further link local MDI importances with Shapley values and discuss them in the light of related measures from the literature. The measures are illustrated through experiments on several classification problems.






Statistical Inference and Learning for Shapley Additive Explanations (SHAP)

arXiv.org Machine Learning

The SHAP (short for Shapley additive explanation) framework has become an essential tool for attributing importance to variables in predictive tasks. In model-agnostic settings, SHAP uses the concept of Shapley values from cooperative game theory to fairly allocate credit to the features in a vector $X$ based on their contribution to an outcome $Y$. While the explanations offered by SHAP are local by nature, learners often need global measures of feature importance in order to improve model explainability and perform feature selection. The most common approach for converting these local explanations into global ones is to compute either the mean absolute SHAP or mean squared SHAP. However, despite their ubiquity, there do not exist approaches for performing statistical inference on these quantities. In this paper, we take a semi-parametric approach for calibrating confidence in estimates of the $p$th powers of Shapley additive explanations. We show that, by treating the SHAP curve as a nuisance function that must be estimated from data, one can reliably construct asymptotically normal estimates of the $p$th powers of SHAP. When $p \geq 2$, we show a de-biased estimator that combines U-statistics with Neyman orthogonal scores for functionals of nested regressions is asymptotically normal. When $1 \leq p < 2$ (and the hence target parameter is not twice differentiable), we construct de-biased U-statistics for a smoothed alternative. In particular, we show how to carefully tune the temperature parameter of the smoothing function in order to obtain inference for the true, unsmoothed $p$th power. We complement these results by presenting a Neyman orthogonal loss that can be used to learn the SHAP curve via empirical risk minimization and discussing excess risk guarantees for commonly used function classes.



Asymptotic Theory and Phase Transitions for Variable Importance in Quantile Regression Forests

arXiv.org Machine Learning

Quantile Regression Forests (QRF) are widely used for non-parametric conditional quantile estimation, yet statistical inference for variable importance measures remains challenging due to the non-smoothness of the loss function and the complex bias-variance trade-off. In this paper, we develop a asymptotic theory for variable importance defined as the difference in pinball loss risks. We first establish the asymptotic normality of the QRF estimator by handling the non-differentiable pinball loss via Knight's identity. Second, we uncover a "phase transition" phenomenon governed by the subsampling rate $β$ (where $s \asymp n^β$). We prove that in the bias-dominated regime ($β\ge 1/2$), which corresponds to large subsample sizes typically favored in practice to maximize predictive accuracy, standard inference breaks down as the estimator converges to a deterministic bias constant rather than a zero-mean normal distribution. Finally, we derive the explicit analytic form of this asymptotic bias and discuss the theoretical feasibility of restoring valid inference via analytic bias correction. Our results highlight a fundamental trade-off between predictive performance and inferential validity, providing a theoretical foundation for understanding the intrinsic limitations of random forest inference in high-dimensional settings.


Variable Importance Using Decision Trees

Neural Information Processing Systems

Decision trees and random forests are well established models that not only offer good predictive performance, but also provide rich feature importance information. While practitioners often employ variable importance methods that rely on this impurity-based information, these methods remain poorly characterized from a theoretical perspective. We provide novel insights into the performance of these methods by deriving finite sample performance guarantees in a high-dimensional setting under various modeling assumptions. We further demonstrate the effectiveness of these impurity-based methods via an extensive set of simulations.